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Abstract 

It  is  anticipated  that  in  future  generations  of  massively  parallel  computer  systems  a  significant  portion  of 
processors  may  suffer  from  hardware  or  software  faults  rendering  large-scale  computations  useless.  In 
this  project,  the  PI  and  his  students  address  this  problem  from  the  algorithmic  side,  proposing  resilient 
frameworks  that  can  recover  and  continue  the  solution  with  gappy  fields  from  such  faults  irrespective  of 
their  fault  origin.  In  addition  to  its  robustness  and  resilience,  the  new  framework  generalizes  previous 
multiscale  and  multifidelity  approaches  in  a  unified  parallel  computational  framework. 


Objectives 

The  general  objective  of  this  project  was  to  develop  a  general  CFD  framework  for  multifidelity  simula¬ 
tions  to  target  multiscale  problems  but  also  resilience  in  exascale  simulations.  The  specific  objective  was  to 
develop  a  fault-recovery  and  fault-resilient  algorithm  using  approximation  theory,  domain  decomposition, 
and  machine  learning  based  information- fusion  together. 

Approach 

Fault-recovery  algorithm 

We  employ  three  different  types  of  recovery  algorithms,  namely  (1)  projective  integration  (temporal 
estimation),  (2)  coKriging  (spatial  estimation),  and  (3)  resimulation  (spatio-temporal  estimation).  We  intro¬ 
duce  the  concepts  of  the  three  approaches  briefly  next,  for  detail  see  (S.  Lee  et  al.  2015). 

First,  if  numerical  solutions  are  sufficiently  smooth  in  time,  the  temporal  estimation  based  on  previous 
saved  data  can  give  a  highly  accurate  result  on  a  missing  part  of  the  solution.  To  accomplish  this,  we 
employ  an  equation-free/Galerkin-free  projective  integration.  The  projective  integration  is  based  on  the 
proper  orthogonal  decomposition  (POD)  for  a  dimension  reduction.  The  basic  algorithm  of  the  projective 
integration  consists  of  three  stages:  the  restriction  (a  dimension  reduction  by  POD),  estimation  (of  the 
coefficient  for  the  POD  basis),  and  lifting  (a  reconstruction  of  the  gappy  field). 

While  for  the  temporal  estimation  we  use  the  previous  flow  field  data  and  smoothness  in  time,  in  the  spa¬ 
tial  estimation  we  need  to  use  geometrically  neighboring  data  points  at  the  current  time  to  exploit  smooth¬ 
ness  in  space.  In  this  project,  a  “multi-fidelity  coKriging  interpolation  method”,  the  unbiased  linear  inter¬ 
polation,  is  introduced  for  estimating  the  missing  part. 

The  “resimulation”  method  is  employed  to  solve  the  Navier-Stokes  equations  again  on  the  missing  part 
only  with  estimated  initial  condition  (by  the  coKriging)  and  estimated  field  variables  at  the  boundary  (by 
the  projective  integration),  see  Figure 
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Figure  1 :  “Re- simulation”  with  estimated  boundary  condition:  First,  we  estimate  the  initial  condition  for  the  missing  part  (blue) 
with  two  sample  sets:  refined  (orange)  and  coarse  (red).  Subsequently,  we  use  the  projective  integration  to  update  the  boundary 
using  the  refined  sample  set.  Finally,  we  solve  the  Navier-Stokes  equation  in  the  missing  part  only. 


Fault-resilient  algorithm  -  Gappy  simulation 

In  the  gappy  simulation  framework,  we  compute  explicitly  the  solution  to  a  PDE  not  on  the  entire  do¬ 
main  but  only  partially  on  some  sub-domains  with  some  auxiliary  data  that  span  the  entire  domain  and 
obtained  independently.  The  main  idea  is  to  combine  the  global  coarse  information  with  some  finely  re¬ 
solved  sub-domains  and  appropriately  combine  the  two  solutions  to  obtain  a  more  accurate  solution  on  the 
entire  domain.  This  set  up  admits  two  different  interpretations.  From  the  multiscale  perspective,  the  global 
coarse  solution  represents  the  large  scales  whereas  the  fine-resolution  sub-domains  represent  regions  of 
finer  scales.  The  gappy  regions  may  also  be  regions  of  finer  scales  but  with  spatial  correlations  determined 
by  the  resolved  regions.  From  the  parallel  computing  perspective,  the  gappy  sub-domains  may  be  regions 
corrupted  by  random  software  or  hardware  faults  whereas  the  global  coarse  solution  is  obtained  on  an  inde¬ 
pendent  small  set  of  processors,  which  is  assumed  to  be  immune  to  such  faults  that  the  big  computer  system 
may  suffer  from. 

A  flow  chart  of  the  gappy  simulation  is  shown  in  Figure]^  First,  upon  notification  of  a  fault  detection 
(not  discussed  here),  we  check  which  domains  are  affected  by  errors,  and  define  computational  subdomains 
and  gaps,  respectively.  Next,  we  choose  a  proper  buffer  size,  and  the  gappy  simulation  estimates  the  fields  at 
the  local  boundaries  of  each  subdomain  by  the  information  fusion  method  using  also  the  independent  auxil¬ 
iary  data.  After  setting-up  all  the  parameters  and  variables,  the  gappy  simulation  solves  each  subdomains  on 
independent  nodes  during  non-interaction  time  steps  r.  After  time  r,  all  subdomains  are  re-joined  together 
and  the  buffer  region  of  each  subdomain  is  cut-off.  Finally,  using  the  auxiliary  data,  the  new  field  variables 
at  the  boundaries  can  be  updated  via  coKriging.  The  gappy  simulation  repeats  again  this  procedure  until  the 
main  simulation  ends  or  all  faults  are  fixed. 

Main  Results 

The  first  main  algorithmic  result  of  this  project  is  the  reconstruction  of  missing  data  using  three  differ¬ 
ent  approaches  according  to  three  fault  scenarios.  These  lead  to  a  robust  and  effective  recovered  solution 
in  various  fault  scenarios.  We  have  also  developed  the  fault-resilient  CFD  algorithm  in  a  unified  parallel 
computational  framework.  Combining  approximation  theory  and  domain  decomposition  together  with  ma¬ 
chine  learning  techniques,  this  results  in  robustness  and  resilience  with  low-resolution  auxiliary  data.  We 
highlight  some  of  the  simulation  results  next. 
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Figure  2:  A  flow  chart  for  a  gappy  simulation  (start  from  left-top):  We  first  check  where  the  gappy  domains  are  located.  Next, 
we  choose  a  buffer  size  and  estimate  field  variables  at  local  boundaries.  Each  sub-domain  is  solved  in  parallel  and  independently 
during  non-interaction  time  r.  After  r,  all  gappy  domains  are  re-joined  together  after  cutting-off  the  buffer  region.  Finally,  using 
information  fusion  method  with  auxiliary  data,  all  field  variables  are  updated  at  the  local  boundaries  of  the  subdomains.  This  is 
one  complete  cycle  of  the  gappy  simulation  algorithm. 


Fault-recovery  Simulations 

We  present  results  for  two  benchmark  problems  -  a  lid-driven  cavity  flow  (quasi- steady)  and  a  flow  past 
a  cylinder  (quasi-periodic),  for  details  see  (Lee  et  al,  2015).  To  this  end,  we  consider  three  types  of  available 
fault  scenarios:  (1)  a  gappy  region  but  with  no  previous  gaps  and  no  contamination  of  surrounding  simula¬ 
tion  data,  (2)  a  space-time  gappy  region  but  with  full  spatiotemporal  information  and  no  contamination,  and 
(3)  previous  gaps  with  contamination  of  surrounding  data.  To  recover  from  such  faults,  we  employ  different 
reconstruction  and  simulation  methods,  namely  the  projective  integration,  the  co-Kriging  interpolation,  and 
the  resimulation  method.  The  results  with  respect  to  RMS  error  and  capability  are  shown  in  Tables[^and|^ 
We  summarize  here  the  main  findings  of  our  study: 

•  For  sufficiently  small  time  gaps,  the  projective  integration  method  is  the  best  while  for  longer  time 
gaps  the  co-Kriging  method  is  better. 

•  Overall,  the  “resimulation”  method  seems  to  be  the  most  robust  method,  performing  well  in  all  three 
fault  scenarios. 

•  Estimating  the  boundary  condition  using  projective  integration  leads  to  accurate  results  for  the  “res¬ 
imulation”  method  in  scenario  3  where  the  other  two  methods  fail. 


Fault-resilient  Simulations 

We  apply  our  fault-resilient  framework  to  the  heat  equation  and  the  Navier-Stokes  equations,  and  obtain 
important  first  results  via  a  parametric  study.  Specifically,  we  employ  the  finite  difference  method  to  perform 
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Table  1 :  Comparison  of  RMS  error  for  three  different  methods  in  lid-driven  cavity  flow.  represent  inability  for  corresponding 
scenario.  _ 


Velocity 

Time  gaps  (AT^) 

Scenario 

P.I. 

CoKriging 

Resimulation 

streamwise 

0.5 

1 

0.0044 

0.0136 

0.0075 

2 

— 

0.0136 

0.0074 

3 

— 

— 

0.0078 

1.0 

1 

0.0156 

0.0150 

0.0124 

2 

— 

0.0150 

0.0122 

3 

— 

— 

0.0158 

crossflow 

0.5 

1 

0.0007 

0.0177 

0.0060 

2 

— 

0.0177 

0.0059 

3 

— 

— 

0.0088 

1.0 

1 

0.0116 

0.0192 

0.0108 

2 

— 

0.0192 

0.0106 

3 

— 

— 

0.0105 

Table  2:  Comparison  of  RMS  error  for  three  different  methods  in  flow  past  a  circular  cylinder.  represent  inability  for  corre¬ 
sponding  scenario. 


Velocity 

Time  gaps  (ATg) 

Scenario 

P.I. 

CoKriging 

Resimulation 

streamwise 

0.27 

1 

0.0039 

0.0219 

0.0060 

2 

— 

0.0219 

0.0172 

3 

— 

— 

0.0175 

0.47 

1 

0.0193 

0.0251 

0.0144 

2 

— 

0.0251 

0.0235 

3 

— 

— 

0.0291 

crossflow 

0.27 

1 

0.0046 

0.0178 

0.0065 

2 

— 

0.0178 

0.0168 

3 

— 

— 

0.0189 

0.47 

1 

0.0231 

0.0159 

0.0149 

2 

— 

0.0159 

0.0241 

3 

— 

— 

0.0374 
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(a)  Buffer  and  Auxiliary  data  (b)  Non-interaction  time  (r) 

Figure  3:  Time  history  of  the  RMS  error  in  the  heat  equation  with  different  parameters.  In  (a),  a  fixed  parameters  is  t=1.  In  (b), 
fixed  parameters  are  the  buffer=30%  and  the  auxiliary  data=6  x  6  grid. 


a  gappy  simulation  in  both  benchmark  problems.  The  gappy  domains  looks  like  a  checker  board,  see  Figure 
We  observe  that  the  RMS  error  of  all  test  simulations  are  converging  to  zero  at  steady-state.  Moreover, 
we  investigate  the  key  parameters  of  this  framework:  1)  type  of  correlation  kernel,  2)  size  of  buffer,  3) 
accuracy  of  auxiliary  data,  and  4)  non-interaction  time,  r.  The  results  of  our  parametric  study  are  shown  in 
Figure|^and|^  We  summarize  here  the  main  findings  of  our  study  below: 

•  Kernel,  the  Matern  kernel  is  found  to  be  the  best  kernel  with  respect  to  RMS  error  and  stability  in 
both  problems. 

•  Buffer,  the  bigger  buffer  can  guarantee  the  smaller  RMS  error  in  both  problems  because  the  error  at 
the  local  boundary  can  be  diffused  in  a  buffer  region.  Moreover,  as  the  auxiliary  data  is  inaccurate  or 
auxiliary  data  may  not  be  available,  the  size  of  buffer  enhances  the  effectiveness  in  this  framework. 

•  Auxiliary  data:  the  finer  resolution  auxiliary  data  gives  the  smaller  RMS  error  in  both  problems 
because  of  increasing  accuracy  of  results  by  information  fusion.  As  shown  in  Figure  and  the 
accuracy  of  auxiliary  data  is  found  to  be  the  most  important  parameter  to  reduce  the  RMS  error 
effectively. 

•  Non-interaction  time  (r):  In  the  heat  equation  (only  diffusion),  near  the  allowable  r,  calculated  by  the 
estimation  of  a  penetration  length  for  a  diffusion,  we  can  guarantee  the  smaller  RMS  error.  However, 
in  the  Navier-Stokes  equations  (combined  diffusion  and  advection),  the  smaller  r  (update  boundary 
values  more  frequently)  gives  the  smallest  RMS  error. 
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(a)  Buffer  (b)  Auxiliary  data 


Time 


(c)  Non-interaction  time  (r) 


Figure  4:  Time  history  of  the  RMS  error  in  a  Naiver-Stokes  equation  with  different  parameters.  In  (a),  fixed  parameters  are  t=5 
and  auxiliary  data=8  x  8  grid.  In  (b),  fixed  parameters  are  the  buffer  =  25%  and  the  t=5.  In  (c)  fixed  parameters  are  the  buffer=25% 
and  the  auxiliary  data=  8x8  grid. 
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